This document is explanatory data analysis of Red Wines dataset. This dataset contains chemical/physical properties of wines, unique id-s and quality parameter marked by professionals.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Observations: 1,599
## Variables: 13
## $ X (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity (dbl) 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity (dbl) 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid (dbl) 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar (dbl) 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides (dbl) 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide (dbl) 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide (dbl) 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density (dbl) 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH (dbl) 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates (dbl) 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol (dbl) 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality (int) 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
There are 1,599 records in dataset and 13 variables.
Lets look on histograms of all variables in dataset (except X)
Total sulfur dioxide distriburtion looks like normal in log10 scale. I’ll add log10 of total.sulfur.dioxide to dataset for future use.
Lets try to add more variables, non-free sulfur dioxide which is the difference between total and free sulfur dioxides.
Almost the same as the total sulfur dioxide
Because there are several variables with heavy tails it’s interesting to take a look on distributions without these tails.
There are 1599 vines with 12 features: ( “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, chlorides“,”free.sulfur.dioxide“,”total.sulfur.dioxide“,”density“,”pH“,”sulphates“,”alcohol“,”quality" ) and X (id) labels
Most wines have quality 5 and 6 (Neutral?) Only few wines have quality 8, even fewer - 3. There are no wines with quality > 8 or quality < 3. Its
There are no wines with alc. less 8.5 Most wines have more than 9% of alcohol and less than 13%
Normal distributions:
Close to normal distribution, with some outliers in right tail:
Skewed to left distributions:
Most interesting in this dataset is quality of wine and and basic chemical characteristics (alcohol, pH, acids, sulphates)
I think it will be interesting to figure out why good wine is good and why bad wine is bad. Because most of wines have average quality - 5 and 6, I’m going to look in detail to wines with low quaility - 3,4 and high - 7,8.
I added non-free sulphur dioxide, but it seems that it is not very helpful. Also I added total sulphur dioxide in log10 scale, because it may be interesting for future investigation.
There we a number of parameters with heavy tails. Replotting them without 5% of tail values, allowed to understand better real distribution.
I draw histograms of 95th percentile of total sulfur dioxide, sulphates, chlorides, volatile acidity and residual sugar. Sulphates, volatile acidity, fixed acidity have distribution close to normal. Removing 5% of largest values made distributions much closer to normal. Same procedure for chlorides, made histogram normal. It is interesting to take a look on these outlier.
It seems that citric acid have a number of zeros. Lets calculate number of zeros and percent of such records
## [1] 132 16
## [1] 0.08255159 1.00000000
There are 132 records (8,2% of wines) that have citric acid equal to zero. According to this artice:
Citric acid is often added to wines to increase acidity, complement a specific flavor or prevent ferric hazes. It can be added to finished wines to increase acidity and give a “fresh” flavor.
So it is interesting to review dependence between citric acid and quality of wine.
I plotted several variable in log10 scale. After that I figured out that total sulfur dioxide distriburtion looks like normal in this scale.
Let’s start with correlations between our parameters.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## log10.total.sulfur.dioxide -0.122541052 -0.11789982 0.073407103
## nonfree.sulfur.dioxide -0.178263036 -0.07814929 0.097033939
## citric.acid residual.sugar chlorides
## X -0.153551355 -0.031260835 -0.119868519
## fixed.acidity 0.671703435 0.114776724 0.093705186
## volatile.acidity -0.552495685 0.001917882 0.061297772
## citric.acid 1.000000000 0.143577162 0.203822914
## residual.sugar 0.143577162 1.000000000 0.055609535
## chlorides 0.203822914 0.055609535 1.000000000
## free.sulfur.dioxide -0.060978129 0.187048995 0.005562147
## total.sulfur.dioxide 0.035533024 0.203027882 0.047400468
## density 0.364947175 0.355283371 0.200632327
## pH -0.541904145 -0.085652422 -0.265026131
## sulphates 0.312770044 0.005527121 0.371260481
## alcohol 0.109903247 0.042075437 -0.221140545
## quality 0.226372514 0.013731637 -0.128906560
## log10.total.sulfur.dioxide -0.003637462 0.147471411 0.060221933
## nonfree.sulfur.dioxide 0.066776040 0.174529035 0.055479649
## free.sulfur.dioxide total.sulfur.dioxide
## X 0.090479643 -0.11784967
## fixed.acidity -0.153794193 -0.11318144
## volatile.acidity -0.010503827 0.07647000
## citric.acid -0.060978129 0.03553302
## residual.sugar 0.187048995 0.20302788
## chlorides 0.005562147 0.04740047
## free.sulfur.dioxide 1.000000000 0.66766645
## total.sulfur.dioxide 0.667666450 1.00000000
## density -0.021945831 0.07126948
## pH 0.070377499 -0.06649456
## sulphates 0.051657572 0.04294684
## alcohol -0.069408354 -0.20565394
## quality -0.050656057 -0.18510029
## log10.total.sulfur.dioxide 0.713535755 0.92313740
## nonfree.sulfur.dioxide 0.425148917 0.95768634
## density pH sulphates
## X -0.36837209 0.13600533 -0.125306999
## fixed.acidity 0.66804729 -0.68297819 0.183005664
## volatile.acidity 0.02202623 0.23493729 -0.260986685
## citric.acid 0.36494718 -0.54190414 0.312770044
## residual.sugar 0.35528337 -0.08565242 0.005527121
## chlorides 0.20063233 -0.26502613 0.371260481
## free.sulfur.dioxide -0.02194583 0.07037750 0.051657572
## total.sulfur.dioxide 0.07126948 -0.06649456 0.042946836
## density 1.00000000 -0.34169933 0.148506412
## pH -0.34169933 1.00000000 -0.196647602
## sulphates 0.14850641 -0.19664760 1.000000000
## alcohol -0.49617977 0.20563251 0.093594750
## quality -0.17491923 -0.05773139 0.251397079
## log10.total.sulfur.dioxide 0.10553948 -0.01483664 0.069754799
## nonfree.sulfur.dioxide 0.09513464 -0.10805328 0.032244043
## alcohol quality
## X 0.24512284 0.06645261
## fixed.acidity -0.06166827 0.12405165
## volatile.acidity -0.20228803 -0.39055778
## citric.acid 0.10990325 0.22637251
## residual.sugar 0.04207544 0.01373164
## chlorides -0.22114054 -0.12890656
## free.sulfur.dioxide -0.06940835 -0.05065606
## total.sulfur.dioxide -0.20565394 -0.18510029
## density -0.49617977 -0.17491923
## pH 0.20563251 -0.05773139
## sulphates 0.09359475 0.25139708
## alcohol 1.00000000 0.47616632
## quality 0.47616632 1.00000000
## log10.total.sulfur.dioxide -0.23085802 -0.17014272
## nonfree.sulfur.dioxide -0.22320257 -0.20546298
## log10.total.sulfur.dioxide
## X -0.122541052
## fixed.acidity -0.117899816
## volatile.acidity 0.073407103
## citric.acid -0.003637462
## residual.sugar 0.147471411
## chlorides 0.060221933
## free.sulfur.dioxide 0.713535755
## total.sulfur.dioxide 0.923137400
## density 0.105539483
## pH -0.014836642
## sulphates 0.069754799
## alcohol -0.230858016
## quality -0.170142719
## log10.total.sulfur.dioxide 1.000000000
## nonfree.sulfur.dioxide 0.846502529
## nonfree.sulfur.dioxide
## X -0.17826304
## fixed.acidity -0.07814929
## volatile.acidity 0.09703394
## citric.acid 0.06677604
## residual.sugar 0.17452903
## chlorides 0.05547965
## free.sulfur.dioxide 0.42514892
## total.sulfur.dioxide 0.95768634
## density 0.09513464
## pH -0.10805328
## sulphates 0.03224404
## alcohol -0.22320257
## quality -0.20546298
## log10.total.sulfur.dioxide 0.84650253
## nonfree.sulfur.dioxide 1.00000000
Most parameters seems to be uncorrelated. There are some correlations ~ 0.67 between fixed.acidity and citric.acid, almost same correlation between free.sulfur.dioxide and total.sulfur.dioxide. Anticorrelation ~ 0.68 is here between pH and fixed.acidity. Highest corelation of quality with alcohol is just 0.48. So, there are no direct dependence of quality from one phys-chem parameter of wine.
Lets take a look on mentioned dependecies
First two graphs didn’t say something new, correlation is clearly visible. As mentioned in Wikipedia:
pH is defined as the decimal logarithm of the reciprocal of the hydrogen ion activity
So it should have better correlation with log10 of fixed.acidity.
## [1] -0.7063602
Now anticorrelation even higher: 0.706
Although there are no good correlation, if we look on outliers from main group, it’s easy to see that sweet wines have less alcohol, and opposite wines which contain more alcohol, have lower sugar, because this sugar is converted to alcohol during wine preparation. One outlier there - wine which contain 14,9% of alcohol and ~ 8% of sugar. May some additional alcohol were added to this wine during production.
Because quality is the most interested feature, let’s take a look on quality plots
Quality has low correlation with other parameters, it has some correlation with alcohol, it’s fun, but this correlation is not strong. On boxplot quality vs citric acid is easy to find that wines, that contain more citric acid, but wines with extreme amount of this acid have low quality. Also there are obvious dependency of quality on suplphates: best wines have larger median value, and anti-correlation on volatile.acidity, it low for best wines.
I found good negative relationship between logarthithm of fixed acidity and pH. Although it’s expectable from pH formula.
There are some correlations ~ 0.67 between fixed.acidity and citric.acid, almost same correlation between free.sulfur.dioxide and total.sulfur.dioxide. Not very strong. Another moment, that best wines have slightly higher average alcohol. I didn’t mention correlations between nonfree.sulfur.dioxide, log10.total.sulfur.dioxide and total.sulfur.dioxide, because first two features were built using last one. As mentioned earlier best correlation between log10 of fixed.acidity and pH = 0.706
Start with scatter plots on features that have visibe influence to quality: volatile.acidity, suplphates, alcohol, citric.acid
Now take a look on same plots, but only for good and bad wines, where quality < 5 and > 6
## [1] 280 17
There are 280 items in dataset without average wines
When I plotted quality data for 4 previously selected features volatile.acidity, suplphates, alcohol and citric.acid on quality, I found that on multivariate graphs dependencies are clear. Because main area of my interest were dependence on quality, I filtered out wines quality 5 and 6 to clearly see when which wines are good, and which are bad. And last bunch of plots have clearly visible groups of different parameters for good and bad wines.
It is interesting that amounts of different acids: citric, acetic and tartaric determine taste and quality of wine. For example best wines tends to have citric acid 0.25-0.5 g/dm^3, but volatile acidity (acetic acid) less than 0.4 g/dm^3. And increasing of citric acid will not make wine bad, but it’s easy if there is a lot of acetic acid in wine.
Distribution of quality of wine. Very close to normal. Experts avoid using low < 3 and high > 8 marks. Why are there not present ratings 0-2 and 9-10? I think it’s possible that expert will rank some wine with rating 1 or 9, but this ratings are medians of at least 3 evaluations made by wine experts. and chances that all experts will put same (very high or very low) score is really small. As for me this fact demonstrates, that wine quality metric highly depends on personal taste, so it will be hard to find some strong correlations or build analytical model.
This plot show 4 characteristics of selected wines (where quality 4 and less or 7 and more). It’s clear that good wines usually have lower volatile acidity with higher amount of sulphates. Another noticable moment, that bad wines have smaller amount of alcohol.
All these 4 paramaters can provide some invormation about quality of wine. So, to be the best wine should:
Although it’s not so easy to build model of bad wine, because parameters have larger diversity for bad wines.
Dataset contains information about 1599 probes of wines with some information about quality marked by experts. Data is full without visible errors and mistakes. Athough there are not too much correlations between characteristics, and most of them are not directly related to quality, it’s possible to find some differencies between good and bad wines, while there are not enough information to say something important about average wine.
I think it will be hard to build some analytical model to predict quality of wine, because paramemers be not linearly separated, and there is large component of personal taste in quality rating. Anyway, it is possible to get some intuition about quality of wine based on amount of acids, sulphates and alcohol in each example.
I was interesting to realize that fixed acidity (tartaric acid, main component of wine acidity) in log10 scale have good anti-correlation with pH, which is log10 characteristic too.
I never thought about acids as important part of wine taste. It’s interesting what happen if I add some amount of lemon juice (where 6-8% of citric acid) to red wine with low acidity. It will be my next experiment.